Kernel-Free Quadratic Surface Regression for Multi-Class Classification

For multi-class classification problems, a new kernel-free nonlinear classifier is presented, called the hard quadratic surface least squares regression (HQSLSR). It combines the benefits of the least squares loss function and quadratic kernel-free trick. The optimization problem of HQSLSR is convex and unconstrained, making it easy to solve. Further, to improve the generalization ability of HQSLSR, a softened version (SQSLSR) is proposed by introducing an ε-dragging technique, which can enlarge the between-class distance. The optimization problem of SQSLSR is solved by designing an alteration iteration algorithm. The convergence, interpretability and computational complexity of our methods are addressed in a theoretical analysis. The visualization results on five artificial datasets demonstrate that the obtained regression function in each category has geometric diversity and the advantage of the ε-dragging technique. Furthermore, experimental results on benchmark datasets show that our methods perform comparably to some state-of-the-art classifiers.


Introduction
Consider a training set: comprising n samples, each represented by a d-dimensional vector x i ∈ R d , and a corresponding label y i ∈ {1, 2, · · · , K}, indicating the class of sample in K classes. For multi-class classification, one popular strategy is to encode each label using onehot encoding. Consequently, the original training set: T 1 (1) is transformed into a new training set where each sample corresponds to a label vector y i = one-hot(y i ) (Definition 3). Our goal is to find K functions f k (x), k = 1, 2, . . . , K that satisfy f (x i ) ≈ y i , where f (x i ) = ( f 1 (x i ), f 2 (x i ), · · · , f K (x i )) T for i = 1, 2, · · · , n. Once these K functions are determined, a new sample x can be classified using the decision rule In recent years, numerous studies have focused on the multi-class classification problem. In 1994, Imran Naseem et al. [1,2] proposed the original least square regression classifier (LSR) based on the label vectors. This method assigns input samples to the class represented by the label vector closest to the predicted vector. To improve the accuracy of LSR, Xian et al. [3] introduced the ε-dragging technique to expand the interval between different classes, creating a discriminative LSR (DLSR). Zhang et al. [4] proposed a retargeted LSR (ReLSR) which learns soft labels with large margin constraints directly from training data. Wen et al. [5] proposed an inter-class sparsity DLSR (ICS_DLSR) by introducing inter-class sparsity constraints. Wang et al. [6] proposed a relaxed group low-rank regression model (RGLRR) that incorporates sparsity consistency and graph embedding into the group low-rank regression model. Recently, scholars have proposed several methods to improve the classification accuracy of DLSR, including the margin scalable DLSR (MSDLSR) [7], the robust DLSR (RODLSR) [8], regularized label relaxation linear regression (RLRLR) [9], low-rank DLSR (LRDLSR) [10], and discriminative least squares regression based on within-class scatter minimization (WSCDLSR) [11]. To improve the classification accuracy of ReLSR, Zhang et al. [12] introduced the intra-class compactness graph into ReLSR, proposing the discriminative marginalized LSR (DMLSR). Additionally, LSR has been extended for feature selection by Zhang et al. [13] and Zhao et al. [14]. All of the above methods are linear classification models, which have less computation time but have difficulty handling nonlinearly separable data. The kernel ridge regression classifier (KRR) was proposed to address the defects previously mentioned, using the kernel trick [15,16]. However, it is challenging to select the appropriate kernel function and kernel parameter.
In this paper, we propose two nonlinear classification models, the hard quadratic surface least squares regression (HQSLSR) and its softened version, the soft quadratic surface least squares regression (SQSLSR). The main contributions of this work are summarized as follows: (1) We propose a novel nonlinear model (HQSLSR), by utilizing a kernel-free trick, which avoids the difficulty of selecting the appropriate kernel functions and corresponding parameters and maintains good interpretability. Moreover, a softened version (SQSLSR) is developed, which employs the ε-dragging technique to enlarge inter-class distances so that its discriminant ability is improved further.
(2) The proposed HQSLSR yields a convex optimization problem without constraints, which can be directly solved. An alteration iteration algorithm is designed for SQLSR, which involves only the convex optimization problem and leads to quick convergence. Additionally, the computational complexity and interpretability of our methods are also discussed.
(3) In numerical experiments, the geometric intuition and advantage of the ε-dragging technique for our methods on artificial datasets are demonstrated. The experimental results over benchmark datasets exhibit that our methods achieve comparable accuracy to other nonlinear classifiers while requiring less computational time cost. This paper is organized as follows. Section 2 briefly describes related work. Section 3 presents the proposed HQSLSR and SQSLSR models and their respective algorithms. Section 4 discusses relevant characteristics. Section 5 presents experimental results, and finally, we conclude in Section 6.

Related Works
In this section, following the presentation of notations, we provide a concise introduction to two fundamental approaches: least squares regression classifiers (LSR) [1] and discriminative least squares regression classifiers (DLSR) [3].

Notations
We begin by presenting the notations employed in this paper. Lowercase boldface and uppercase boldface fonts represent vectors and matrices, respectively. The vector (1, 1, · · · , 1) T ∈ R n is represented by 1 n . Define the zero vector and null matrix as 0 and O, respectively. For a matrix W = (w ij ) d×K , its i-th column is denoted as w i . In addition, we give the following three definitions. Definition 1. For any real symmetric matrix A = (a ij ) d×d ∈ S d , its half-vectorization operator can be defined as follows: hvec(A) = (a 11 , a 12 , · · · , a 1d , a 22 , · · · , a 2d , · · · , a dd ) T ∈ R d 2 +d 2 .
Definition 2. For any vector x = (x 1 , x 2 , · · · , x d ) T , its quadratic vector with cross terms can be defined as follows: Definition 3. For any given positive integer k ∈ {1, 2, · · · , K}, the one-hot encoding operator is defined as follows: where e k is the K-dimensional unit vector, with the k-th element 1.

Least Squares Regression Classifier
Given a training set T 2 (2), the goal of LSR is to find the following K linear functions: where To obtain the K linear functions (4), the following optimization problem is formulated as min W,c where the sample matrix X = (x 1 , x 2 , · · · , x n ) ∈ R d×n is formed by all the samples in the training set T 2 (2), the label matrix Y = (y 1 , y 2 , · · · , y n ) T ∈ R n×K is formed by the label vectors in T 2 (2), and W = (w 1 , w 2 , · · · , w K ) ∈ R d×K , c = (c 1 , c 2 , · · · c K ) T ∈ R K are formed by the normal vectors and biases of the K linear functions (4), respectively. Clearly, the optimization problem (5) is a convex optimization problem, and its solution has the following form: where H = I − 1 n 1 n 1 T n . Thus, once the solutions W, c of the optimization problem (5) is obtained, we can find the K linear functions.
For a new sample x ∈ R d , its class is obtained by the following decision function:

Discriminative Least Squares Regression Classifier
Xiang et al. [3] proposed the discriminative least squares regression classifier (DLSR) to improve the classification performance of LSR.
For the training set T 2 (2), we define the constant matrix B = (B ik ) n×K as follows: where y ik represents the k-th component of the label vector y i of the i-th sample, the optimization problem of DLSR is formulated as follows: where is the Hadamard product of matrices. E = (ε ik ) n×K is an ε-dragging matrix to be found, and each of its non-negative elements ε ik is called the ε-dragging factor. It is evident that DLSR takes into account the inter-class distance based on LSR. Specifically, DLSR increases inter-class distances by introducing the ε-dragging technique, causing different classes of regression targets to move in opposite directions.

Kernel-Free Nonlinear Least Squares Regression Classifiers
For multi-class classification problems with the training set T 2 (2), we propose the hard quadratic surface least squares regression classifier (HQSLSR) and its softened version (SQSLSR). The relevant properties of our methods are also analyzed theoretically.

Hard Quadratic Surface Least Squares Regression Classifier
For the training set T 2 (2), we aim to find K quadratic functions as follows: where A k ∈ S d , b k ∈ R d , c k ∈ R. If these K quadratic functions are found, the label of a new sample x is determined by the following decision rule: In order to find the K quadratic functions (9), we construct the following optimization problem: where λ is the regularization parameter, hvec(A k ) is a vector by Definition 1, which is constituted by the upper triangular elements of the symmetry matrix A k , and y ik indicates the k-th component of the label vector y i of the i-th sample. For the objective function (11), its first term minimizes the sum of the squares of the errors between the real and predicted label; the second term is a regularization term about the model coefficients, which aims to enhance the generalization ability of our model. It is worth noting that the upper triangular elements of the matrix A k instead of all elements are involved in the regularization term by using the symmetry of the matrix. For convenience, by using the symmetry of the matrix A k and following Definitions 1 to 2, the first term of the objective function in the optimization problem (11) is simplified as follows: By Equation (13), Furthermore, combining Equation (12), the optimization problem (11) is further formulated as min W,c where Z = (z 1 , z 2 , · · · , z n ) ∈ R d 2 +3d Next, the solution of the optimization problem (15) is given by the following theorem.

Theorem 1.
The optimal solution of the optimization problem (15) is as follows where H = I − 1 n 1 n 1 T n .
Proof. Obviously, Formula (15) is a convex optimization problem. According to the optimality condition of the unconstrained optimization problem, we have According to Equation (18), we obtain By substituting Equation (20) into Equation (19), we have where After solving the optimization problem (15) from Theorem 1, w k and c k are obtained by the k-th column of matrix W and the k-th component of vector c, respectively. Then, A k and b k can be obtained by Equation (13). Therefore, the decision function in Equation (10) can be established.

Soft Quadratic Surface Least Squares Regression Classifier
In this subsection, we propose the SQSLSR by introducing the ε-dragging factor into the HQSLSR. For the training set T 2 (2), the following optimization problem is constructed: where A k , b k , c k , ε ik , i = 1, 2, · · · , n, k = 1, 2, · · · , K are variables to be found, respectively. ε ik ≥ 0 is the ε-dragging factor, and the constant B ik is defined in detail in Equation (7). The distance between the label vectors of different classes is expanded by using the ε-dragging factor. Therefore, compared with the HQSLSR model, the SQSLSR model distinguishes samples from different classes more easily. For simplicity, by defining the ε-dragging matrix E as being similar to the transformation of the optimization problem (11), the optimization problem (22) is equivalently expressed as follows: where E ≥ O means that the elements of the matrix E are non-negative. To solve the optimization problem (23), we use the alternating iteration method. First, update W and c. By fixing the dragging matrix E and letting Y = Y + B E, the optimization problem (23) is simplified as follows: Similar to the solution of the optimization problem (15), the iterative equation for the optimization problem (24) with respect to W and c is as follows: where H = I − 1 n 1 n 1 T n . Then, update the draggings matrix E. By fixing W, c and letting the residual matrix The solution to the optimization problem (27) can be obtained by the following equation: Specifically, according to the definition of the Frobenius norm, solving the optimization problem (27) is equivalent to solving the following n ×K subproblems: where R ik is the element of the i-th row and k-th column of the matrix R.
Then the solution to the optimization problem (29) is ε ik = max(B ik R ik , 0). Thus, Equation (28) is the solution to the optimization problem (27).
Through the above solution process, we briefly summarize the algorithm of the optimization problem (23) as follows: After obtaining A k , b k , c k , k = 1, 2, . . . , K by Algorithm 1, the corresponding decision function (10) can also be constructed.

Discussion
In this section, we first discuss the convergence of Algorithm 1. Then, we discuss the computational complexities of HQSLSR and SQSLSR, respectively. Lastly, we analyze their interpretability.

Convergence Analysis
Since Algorithm 1 adopts an iterative method to solve the optimization problem (23), its convergence is discussed in this subsection.

Theorem 2.
If the sequence of iterations {W t , c t , E t } can be obtained by Algorithm 1, then the objective function J 2 (W t , c t , E t ) of the optimization problem (23) is monotonically decreasing.
Proof. First, let t be the number of current iterations. Then, we define the value of the objective function of the optimization problem (23) as J 2 (W t , c t , E t ).
By the strong convexity of the optimization problem, given E t , W t+1 and c t+1 can be obtained from Equations (25) and (26), respectively, and have the following inequality: Then, fixing W t+1 and c t+1 , E t+1 can be obtained from Equation (28), and with the following inequality: Combining the inequalities (30) and (31), we have the following inequality: Thus, the proof is complete.

Computational Complexity
In this subsection, we provide a detailed analysis of the computational complexities of our methods. Here, n, d, and K represent the number of samples, features, and classes, respectively. From Definition 1, Definition 2, and Equation (12), it can be observed that our methods aim to transform the feature dimension of the sample from a d-dimensional space to an l = d 2 +3d 2 -dimensional space. For simplicity, we ignore the computational cost of addition and subtraction.
The HQSLSR classifier is solved by Equations (16) and (17), which involve matrix inversion and multiplication. Therefore, the computational complexity of the HQSLSR classifier is about O(l 3 + nl 2 + (n 2 + nK)l).
According to Algorithm 1, we briefly analyze the computational complexity of SQSLSR. The computational complexity of SQSLSR is mainly concentrated on steps 5, 8, 9, and 10 of Algorithm 1.
Step 5 involves matrix inversion and multiplication, and its computational complexity is O(l 3 + nl 2 + n 2 l). Steps 8, 9, and 10 involve only matrix multiplication, so the computational complexity of each iteration is about O(nKl + nK). In summary, the total computational complexity of SQSLSR is about O(l 3 + nl 2 + n 2 l + t(nKl + nK)), where t is the number of iterations.

Interpretability
Although HQSLSR and SQSLSR are kernel-free, they can achieve the goal of nonlinear separation and retain interpretability. Therefore, we further elaborate on their interpretability.
Note that the decision functions of our methods are constructed by the separation quadratic function where x i is the i-th feature of the vector x ∈ R d , a ij is the element of the i-th row and j-th column of the symmetry matrix A ∈ S d , and b i is the i-th component of the vector b ∈ R d , c ∈ R. From the quadratic function (33), we can see that the values of b i , a ii (i = j), and a ij (i = j) determine the contributions of the first order term and the second order term of the i-th feature x i , and the cross term of x i and x j , respectively. Roughly speaking, let θ i,h(x) = |a ii | + |a ij | + |b i | (j = 1, 2, · · · , d, j = i), the higher the value of θ i,h(x) , the more the i-th feature x i contributes to the quadratic function (33).
For K quadratic functions f k (x), k = 1, · · · , K as shown in Equation (10), let θ i,k = θ i, f k (x) represents the contribution of the i-th feature to the k-th quadratic function f k (x), The larger θ i is, the more important the i-th feature is to the decision function (10). In particular, when θ i = 0, the i-th feature of x does not work. Therefore, our methods have a certain interpretability.

Numerical Experiments
In this section, we first implement our SQSLSR and HQSLSR on five artificial datasets to show their geometric meaning and compare them with LSR and DLSR. We also carry out our SQSLSR and HQSLSR on 16 UCI benchmark datasets, and compare their accuracy with LSR, DLSR, LRDLSR, WCSDLSR, linear discriminant analysis(LDA), QSSVM, reg-LSDWPTSVM [22], SVM, and KRR. For convenience, SVMs with a linear kernel and rbf kernel are denoted by SVM-L and SVM-R, respectively. KRRs with an RBF kernel and polynomial kernel are denoted as KRR-R and KRR-P, respectively. Remarkably, on multi-class classification datasets, the SVM and QSSVM methods use the one-against-rest strategy [30]. We adopt the five-fold cross-validation to select the parameters in these methods. The regularization parameters of SQSLSR and other methods are selected from the set {2 −8 , 2 −7 , · · · , 2 8 }. The parameters of the RBF kernel and polynomial kernel are selected from the set {2 −6 , 2 −4 , · · · , 2 6 }. All numerical experiments are executed using MATLAB R2020(b) on a computer with a 2.80 GHz (I7-1165G7) CPU and 16 G available memory.

Experimental Results on Artificial Datasets
We construct five artificial datasets to demonstrate the geometric meaning of our methods and the advantage of the ε-dragging technique. Datasets I-IV are binary classifications, where each dataset contains 300 points, and each class has 150 points. Dataset V has three classifications, and each class has 20 points. As the decision functions of our proposed HQSLSR and SQSLSR methods, as well as the comparison methods LSR and DLSR, are all composed of K regression functions, we present K pairs of regression curves f k (x) = 0 and 1, k = 1, 2 to display their classification results. Here, f k (x) = 1 is the regression curve of the k-th class, f k (x) = 0 is the regression curve of samples other than class k, k = 1, 2.
The first-class samples, f 1 (x) = 1 and f 1 (x) = 0 are indicated by the blue "+", blue line and blue dotted line, respectively. The second-class samples, f 2 (x) = 1 and f 2 (x) = 0 are represented by the red "•" , red line and red dotted line, respectively. The accuracy of each method on the artificial dataset is shown in the top right corner.
The artificial dataset I is linearly separable. Figure 1 shows the results of the four methods, including LSR, DLSR, HQSLSR, and SQSLSR. It can be observed that f 1 (x) = 1 and f 2 (x) = 0 coincide; f 2 (x) = 1 and f 1 (x) = 0 coincide too. The samples of each class come close to the corresponding regression curve, and stay away from the regression curves of the other classes. In addition, the four methods can correctly classify the samples on this linear separable artificial dataset I. As shown in Figure 2, the artificial dataset II includes some intersecting samples. Our methods outperform LSR and DLSR in terms of classification accuracy, because our HQSLSR and SQSLSR can obtain two pairs of regression curves, while LSR and DLSR can only obtain two pairs of straight regression lines. It is worth noting that the accuracy of SQSLSR is slightly higher than that of HQSLSR, because the SQSLSR uses the ε-dragging technique to relax the binary labels into continuous real values, which enlarges the distances between different classes and makes the discrimination better. Figure 3 shows the visualization results of the artificial dataset III, which is sampled from two parabolas. Note that our HQSLSR and SQSLSR can obtain parabolic-type regression curves while LSR and DLSR can only obtain straight regression lines, so our methods are more suitable for this nonlinearly separable dataset.  The results of the artificial dataset IV are shown in Figure 4. The nonlinearly separable dataset IV is obtained by sampling from two concentric circles. Obviously, our HQSLSR and SQSLSR have higher accuracy for this classification task, as shown in Figure 4. However, from the first two subfigures, it is not difficult to find that samples of these two classes are far away from their respective regression curves, resulting in poor results of LSR and DLSR. Note that f 1 (x) = 0 and f 2 (x) = 1 coincide and lie at the center of the concentric circles, which are not easy to observe. Thus we only display f 1 (x) = 0.1 and f 2 (x) = 0.9, as shown in last two subfigures. We conducted experiments on the artificial dataset V to investigate the influence of the ε-dragging technique. The dataset consists of 60 samples from three classes, with 20 samples from each class arranged in three groups: left, middle, and right. By solving the optimization problems of HQSLSR (15) and SQSLSR (23) on dataset V, we obtained the corresponding regression labelsf ( wheref k (x), f k (x), k = 1, 2, 3 represent the three regression functions solved by HQSLSR and SQSLSR, respectively. The difference caused by the ε-dragging technique is represented by D = ( f (x) −f (x)), which includes three components related to the corresponding three classes. Figure 5 illustrates the relationship between the index of training samples and the three components of the difference D. According to the results presented in Figure 5b, the first component of the difference matrix D exhibits positive values for the first 20 samples, while negative values are observed for the last 40 samples. This observation suggests that the introduction of the ε-dragging technique has effectively increased the gap in the first component of the difference matrix D between the first class and the remaining classes. Additionally, Figure 5c,d demonstrate that the second and third components of the difference matrix D highlight the second and third classes of samples, respectively. Therefore, the ε-dragging technique has successfully enlarged the differences in regression labels among samples from different classes, thereby enhancing the robustness of the model.
Based on the experimental results presented above, it can be concluded that the regression curve f k (x) = 1, k = 1, 2, · · · , K should be close to the samples from the k-th class while being distant from the samples of other classes. The K pairs of regression curves can be modeled as arbitrary quadratic surfaces in the plane. This approach enables HQSLSR and its softened version (SQSLSR) to achieve higher accuracy. SQSLSR utilizes the ε-dragging technique to relax the labels, which forces the regression labels of different classes to move in opposite directions, thereby increasing the distances between classes. Consequently, SQSLSR exhibits better discriminative ability than HQSLSR.

Experimental Results on Benchmark Datasets
In order to validate the performances of our HQSLSR and SQSLSR, we compare them with linear methods LSR, DLSR, LDA, SVM-L, LRDLSR, WCSDLSR, and nonlinear methods QSSVM, SVM-R, KRR-R, KRR-P, and reg-LSDWPTSVM. These methods are implemented on 16 UCI benchmark datasets. Numerical results are obtained by repeating five-fold cross-validation five times, including average accuracy (Acc), standard deviation (Std), and computing time (Time). The best results are highlighted in boldface. Lastly, we also calculated the sensitivity and specificity of each method on six datasets to further evaluate their classification performances. Table 1 summarizes the basic information about the 16 UCI benchmark datasets, which are taken from the website https://archive.ics.uci.edu/ml/index.php (the above datasets accessed on 18 August 2021). In Table 2, we show the experimental results of the above 13 methods on the 16 benchmark datasets. It is obvious from Table 2 that our HQSLSR and SQSLSR outperform linear methods LSR, LDA, DLSR, LRDLSR, WCSDLSR, and SVM-L in terms of classification accuracy on almost all datasets. Moreover, the accuracy of our HQSLSR and SQSLSR are similar to other nonlinear classification methods: SVM-R, SVM-P, KRR-R, KRR-P, QSSVM, and reg-LSDWPTSVM. Note that our SQSLSR has the highest classification accuracy on most datasets. In addition, in terms of computation time, our methods not only have less time cost than the compared nonlinear methods, but also have a narrow gap with the fastest linear method LSR. In general, our HQSLSR and SQSLSR can achieve higher accuracy without increasing the time cost too much, and the generalization ability of SQSLSR in particular is better.
To further evaluate the classification performances of these 13 methods, we show the specificity and sensitivity of the 13 methods on the datasets in Table 3. It can be seen from Table 3, our HQSLSR and SQSLSR perform well in terms of specificity and sensitivity on most of the benchmark datasets.

Statistical Analysis
In this subsection, we use the Friedman test [31] and the Neymani test [32] to further illustrate the differences between our two methods and other methods.
First, we carry out the Friedman test, where the original hypothesis is that all methods have the same classification accuracy and computation time. We ranked these 13 methods based on their accuracy and computation time on the 16 benchmark datasets and presented the average rank r i (i = 1, 2, · · · , 13) for each algorithm in Tables 4 and 5. Let N and s denote the number of datasets and algorithms, respectively. The relevant statistics are obtained by where τ F follows an F-distribution with degrees of freedom s − 1 and (s − 1)(N − 1). According to Equation (35), we obtain two Friedman statistics τ F , which are = 12.6243 and 109.9785, and the critical value corresponding to α = 0.05 is F α = 1.8063. Since τ F > F α , we reject the original hypothesis.
Rejection of the original hypothesis suggests that our HQSLSR, SQSLSR, and other methods perform differently in terms of accuracy and computation time. To further distinguish these methods in terms of classification accuracy and computation time, a Nemenyi test is further adopted, and the critical difference is calculated with the following equation: when α = 0.05, q α = 3.313, we obtain CD = 4.5616 by Equation (36). Figures 7 and 8 visually display the results of the Friedman test and the Nemenyi post hoc test. The average rank of each method is marked along the axis. Groups of methods that are not significantly different are connected by red lines.
On the one hand, our methods HQSLSR and SQSLSR are not very different from SVM-R, KRR-R, and KRR-P and are significantly better than LSR, DLSR, LDA, SVM-L, and QSSVM in terms of classification accuracy. On the other hand, our methods HQSLSR and SQSLSR are not very different from LSR, DLSR, and LDA and are significantly better than WCSDLSR, KRR-R, KRR-P, SVM-L, reg-LSDWPTSVM, SVM-R, and QSSVM in terms of computation time. In general, our HQSLSR and SQSLSR can achieve higher accuracy while maintaining relatively small computation time.

Conclusions
In this paper, utilizing the kernel-free trick and ε-dragging technique, we propose two classifiers, HQSLSR and its softened version (SQSLSR). On the one hand, the quadratic surface kernel-free trick is introduced, which avoids the difficulty of selecting the appropriate kernel functions and corresponding parameters while maintaining good interpretability. On the other hand, utilizing the ε-dragging technique makes the labels more flexible and enhances the generalization ability of SQSLSR. Our HQSLSR can be solved directly, while SQSLSR is solved by an alternating iteration algorithm which we designed. Additionally, the computational complexity, convergence analysis, and interpretability of our methods are also addressed. The experimental results on artificial and benchmark datasets confirm the feasibility and effectiveness of our proposed methods.
In future work, we aim to address several challenges to extend the HQSLSR and SQSLSR models. Specifically, we plan to simplify the quadratic surface to enable our approaches to process high-dimensional data, such as image data. Moreover, we intend to incorporate suitable sparse regularization terms to achieve feature selection.
Funding: This work is supported by the National Natural Science Foundation of China (No. 12061071).